Goto

Collaborating Authors

 subquadratic high-dimensional hierarchical clustering


Subquadratic High-Dimensional Hierarchical Clustering

Neural Information Processing Systems

We consider the widely-used average-linkage, single-linkage, and Ward's methods for computing hierarchical clusterings of high-dimensional Euclidean inputs. It is easy to show that there is no efficient implementation of these algorithms in high dimensional Euclidean space since it implicitly requires to solve the closest pair problem, a notoriously difficult problem. However, how fast can these algorithms be implemented if we allow approximation? More precisely: these algorithms successively merge the clusters that are at closest average (for average-linkage), minimum distance (for single-linkage), or inducing the least sum-of-square error (for Ward's). We ask whether one could obtain a significant running-time improvement if the algorithm can merge $\gamma$-approximate closest clusters (namely, clusters that are at distance (average, minimum, or sum-of-square error) at most $\gamma$ times the distance of the closest clusters). We show that one can indeed take advantage of the relaxation and compute the approximate hierarchical clustering tree using $\widetilde{O}(n)$ $\gamma$-approximate nearest neighbor queries.

  algorithm, name change, subquadratic high-dimensional hierarchical clustering, (8 more...)

Reviews: Subquadratic High-Dimensional Hierarchical Clustering

Neural Information Processing Systems

This paper proposes a new approach to approximating hierarchical agglomerative clustering (HAC) by requiring that at each round, only a gamma-best merge be performed (gamma being the multiplicative approximation factor to the closest pair). Two algorithms are introduced to approximate HAC - one for Ward and one for Average linkage. In both cases, the algorithms rely on using approximate nearest neighbor ANN as a black box. In addition, a bucketing datastructure is used in Wards algorithm and a subsampling procedure in used for Average linkage to guarantee the subquadratic runtime. This is a new contribution to the theoretical literature on HAC, a provable subquadratic algorithm for (an approximation to) HAC cases other than single linkage.


Reviews: Subquadratic High-Dimensional Hierarchical Clustering

Neural Information Processing Systems

This paper was very much a borderline paper, with two accept scores and one reject score. One of the concerns raised by the negative reviewer was that, while the algorithm can achieve an approximation to the best merge at each step, it is unclear how the final clustering results would compare to the standard algorithm. The authors addressed this in their rebuttal, which helped. Also, there were some issues raised about experiments as well as various minor suggestions (typos etc.). In general it seems that the concerns are mostly minor, and on the whole this paper seems to make an interesting and worthwhile contribution, so I am recommending that the paper is accepted.

  algorithm, subquadratic high-dimensional hierarchical clustering

Subquadratic High-Dimensional Hierarchical Clustering

Neural Information Processing Systems

We consider the widely-used average-linkage, single-linkage, and Ward's methods for computing hierarchical clusterings of high-dimensional Euclidean inputs. It is easy to show that there is no efficient implementation of these algorithms in high dimensional Euclidean space since it implicitly requires to solve the closest pair problem, a notoriously difficult problem. However, how fast can these algorithms be implemented if we allow approximation? More precisely: these algorithms successively merge the clusters that are at closest average (for average-linkage), minimum distance (for single-linkage), or inducing the least sum-of-square error (for Ward's). We ask whether one could obtain a significant running-time improvement if the algorithm can merge \gamma -approximate closest clusters (namely, clusters that are at distance (average, minimum, or sum-of-square error) at most \gamma times the distance of the closest clusters).


Subquadratic High-Dimensional Hierarchical Clustering

Abboud, Amir, Cohen-Addad, Vincent, Houdrouge, Hussein

Neural Information Processing Systems

We consider the widely-used average-linkage, single-linkage, and Ward's methods for computing hierarchical clusterings of high-dimensional Euclidean inputs. It is easy to show that there is no efficient implementation of these algorithms in high dimensional Euclidean space since it implicitly requires to solve the closest pair problem, a notoriously difficult problem. However, how fast can these algorithms be implemented if we allow approximation? More precisely: these algorithms successively merge the clusters that are at closest average (for average-linkage), minimum distance (for single-linkage), or inducing the least sum-of-square error (for Ward's). We ask whether one could obtain a significant running-time improvement if the algorithm can merge $\gamma$-approximate closest clusters (namely, clusters that are at distance (average, minimum, or sum-of-square error) at most $\gamma$ times the distance of the closest clusters).